Primary Goal:
Perform QC analysis on a single cell data set using the R Seurat package.
Perform QC analysis on a single cell data set using the R Seurat package.
Use the Athena terminal to set up your lab directory and obtain needed files.
# create and navigate into a new directory for this lab
mkdir scRNASeq
cd scRNASeq
# Copy feature count directory
cp -r /lustre/home/cctr691/scRNASeq/filtered_feature_bc_matrix_WHIM2_106361_HumanOnly .
# Copy supporting files
cp /lustre/home/cctr691/scRNASeq/MitoCodingGenes13_human.txt .
cp /lustre/home/cctr691/scRNASeq/UCD52CR_107086_web_summary.html .
The majority of this lab will be completed in the Rmarkdown template EXCEPT this first section.
Download the “UCD52CR_107086_web_summary.html” file to your laptop and open it in a web browser, then answer the first set of questions in the RMarkdown template.
Q1) Was this sample aligned to a SINGLE or MULTIPLE genomes?
2 genomes
Q2) With the knowledge that we expect about 5000 cells from this sample, list at least 2 red flags identified by this report. Be sure to explore both tabs.
Red Flag 1: median genes per cell mm10
Red Flag 2: reads mapped to intergenic regions mm10
Follow the instructions and code to remove dead cells, generate violin plots, cluster, and create tSNE plots of the sample data.
The sample data you are using is publicly available on GEO (Gene Expression Omnibus) in the Series GSE174391, only it is merged with other samples. For this lab I am providing you with the single WHIM2 PDX sample that has already had the mouse cells removed (so you only have human data).
Also, to get help using any command, type “?” followed by the command to see the documentation. E.g. “?plot”
Follow the steps below and answer the questions. This lab is based off of, but not exactly the same as, the Seurat introductory tutorial. Feel free to explore that tutorial for more in-depth information.
Edit the setwd() command to point to YOUR lab directory!
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Load the WHIM2 dataset into Seurat:
whim.data <- Read10X(data.dir = "./scRNASeq/filtered_feature_bc_matrix_WHIM2_106361_HumanOnly")
#Initialize the Seruat Object:
whim <- CreateSeuratObject(counts = whim.data, project = "whim2", min.cells = 3, min.features = 200)
#View a summary of the object:
whim
## An object of class Seurat
## 19612 features across 2575 samples within 1 assay
## Active assay: RNA (19612 features, 0 variable features)
## 1 layer present: counts
In the environment tab of the RStudio interface, click on the “whim” object to expand and explore it. Look for the “meta.data” section. This is the most important section as it is where your new annotations are stored, and it tells you the name of annotations added by clustering and other processing.
Q3) What metadata elements are listed in this section?
identity, count, and feature
Q4) I told you this is 1 PDX sample, but the summary of the Seurat object says differently. How many samples does this object have? What is this actually telling you? What does the 19,612 number represent?
2575 cells and the 19,612 represents the number the number of genes.
# Load in the mitochondrial gene IDs:
mitogene_ids <- read.delim("./scRNASeq/MitoCodingGenes13_human.txt", header = FALSE, stringsAsFactors = FALSE)[[1]]
# Add the percentage of mito expression as an annotation to your Seurat object:
whim[["percent.mt"]] <- PercentageFeatureSet(whim, features = mitogene_ids)
# Create a Violin plot of the mito expression, nFeature and nCount data
# These last 2 were already in the dataset by default, so we didn’t need to add them.
VlnPlot(whim, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = .4)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
Q5) What is a reasonable filtering criteria for the percent of mitochondria expression?
use a percent cut off for percent mito expression.
Sub set your cells to remove those that do not pass your mitochondrial cutoff. Make sure to replace the “ADD_YOUR_CUTOFF_HERE” with the appropriate expression (i.e. “> x” or “< x” where “x” is the threshold you specified in Q5).
# subset the seurat object for only cells passing your cutoff and print out the new violin plot
whim_filtered <- subset(whim, subset = percent.mt <25)
# print new violin plot
VlnPlot(whim_filtered, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = .4)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
Note that we now have two (2) Seurat data sets, “whim” contains the full unfiltered dataset, and “whim_filtered” contains the filtered dataset.
Q6) How many cells survived the filtering? Print out the new “whim_filtered” object to figure this out.
1314
Q7) Compare the 2 plots, before and after. What can you conclude about the cells that did not make the mitochondrial cutoff in terms of nFeature and nCount? If you want to print both plots in the same R code chunk to compare more easily you can do that.
lots of mitochondria, cells that did not make the cut off were removed and made prettier plots or more clean plots.
You can continue to filter on nFeature and nCount, and I generally do, but for the purposes of this lab we will leave the dataset as-is and move on to normalization and clustering.
Before we can cluster the data and create those cool tSNE and UMAP visualizations, we need to normalize and scale the data, and run a PCA analysis. However, this can take a long time if we use all genes. Thus, we will reduce the set used to the most variable genes (don’t worry, the rest are still there:).
# Using the “whim_filtered” Seurat data set, normalize the data.
whim_filtered <- NormalizeData(whim_filtered)
## Normalizing layer: counts
# Before creating the visualizations, we need to reduce the dataset down to only
# those genes with the most variance. These will be our top 2000 most variable features.
whim_filtered <- FindVariableFeatures(whim_filtered, selection.method = "vst", nfeatures = 2000)
## Finding variable features for layer counts
top10 <- head(VariableFeatures(whim_filtered), 10)
# Plot the top10 most variable genes as a scatter plot and mark the top 10
plot1 <- VariableFeaturePlot(whim_filtered)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot2
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Q8) Which gene is the most variable?
SCGB2A2 shows the most variance
What does scaling do? * Shifts the expression of each gene, so that the mean expression across cells is 0 * Scales the expression of each gene, so that the variance across cells is 1 * This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
Principle Component Analysis is a dimension reduction technique that transforms large data sets with a lot of dimensions into a data set with fewer dimensions, but still retains the majority of the information present, such as variability, in the initial data set. In single cell RNASeq it is used as a first step to identify the most informative principle components that should be used for data visualization by UMAP and tSNE.
# scale data
whim_filtered <- ScaleData(whim_filtered)
## Centering and scaling data matrix
# run PCA
whim_filtered <- RunPCA(whim_filtered, features = VariableFeatures(object = whim_filtered))
## PC_ 1
## Positive: MGP, SCGB3A1, PHLDA1, LBH, SCGB2A2, EMP1, CX3CL1, PLCG2, CEBPD, PXDC1
## PIP, KRT16, ORAI3, ZG16B, INHBA, LINC02613, SCGB2A1, DNAH11, DHRS3, TNFSF10
## MIR210HG, CRYAB, FMO2, KRTAP5-8, RNASE7, LEMD1, NDRG1, CALML5, IGFBP5, CA9
## Negative: UBE2T, CENPF, ASPM, KIFC1, H2AFV, GMNN, ESCO2, TUBA1B, HIST1H4C, TMEM106C
## SIVA1, MKI67, NUSAP1, RRM2, STMN1, PSIP1, TPX2, DEK, LBR, KIF23
## ANP32E, UBE2C, TUBB, BIRC5, NASP, AURKB, DHTKD1, KIF4A, PCLAF, ANLN
## PC_ 2
## Positive: NEAT1, IER3, KLF6, MALAT1, NDRG1, AHNAK, CEBPD, SAT1, VEGFA, PNRC1
## LGALS3, TFCP2L1, ELF3, FBXO32, COL6A2, P4HA2, USP53, KCNQ1OT1, OLMALINC, ATF3
## ZFP36L1, RRAGD, ERO1A, BNIP3L, DUSP4, SOX4, C4orf3, BNIP3, ANKRD37, ZFP36L2
## Negative: H2AFZ, UBE2C, TOP2A, PBK, UBE2S, CCNB1, NUSAP1, PTTG1, TUBA1B, MKI67
## CDCA8, HMGB1, CENPA, HMGB2, CENPF, STMN1, BIRC5, HMMR, TPX2, CCNA2
## NUF2, UBE2T, RRM2, CDK1, MAD2L1, CENPW, ASPM, AURKA, AURKB, PLK1
## PC_ 3
## Positive: HMGB2, TOP2A, NUSAP1, DLGAP5, MKI67, CENPA, CCNA2, TPX2, PFKP, NUF2
## NDRG1, KIF4A, HMMR, UBE2C, HIST1H1B, AURKB, CAMK1D, ASPM, CENPE, PIF1
## CENPF, ERO1A, NDC80, CDKN3, KNL1, NEK2, GTSE1, BUB1, CDCA3, PBK
## Negative: SELENBP1, AZGP1, OAZ3, PAK1IP1, NES, IFITM3, LY6E, EEF1E1, SRM, HACD3
## CITED4, FKBP1A, NME1, SORBS2, NFIA, ROPN1, STAC2, NDUFAF4, SDAD1, TSC22D3
## ZG16B, PLA2R1, SDF2L1, NPM3, ATP5PB, E2F1, MTX1, SELENOW, KCNQ1OT1, PPAN
## PC_ 4
## Positive: AQP3, FAM3D, COL9A3, CCL28, KRT14, CDK6, KLK7, TPT1-AS1, SLC34A2, RAB11FIP1
## CXCL17, KLK10, NCCRP1, LBH, KLF5, S100P, INHBA, KLF6, SULF2, ANXA1
## CEBPD, ATF3, BTG2, CEACAM1, KRT16, CALML5, SCGB3A1, ADD3, NR4A1, KLK6
## Negative: EGLN3, PFKP, ANGPTL4, NRN1, CAMK1D, IGFBP5, SLC16A3, CHI3L1, NDRG1, LOX
## P4HA1, SLC2A1, FZD8, CA9, VEGFA, AKR1C2, ERO1A, NCAN, FAM162A, STC1
## EGFR, BNIP3, CSF3R, SCGB2A2, SCGB1D2, AKR1C3, WDR54, COL23A1, CD5L, ALDOC
## PC_ 5
## Positive: MCM4, DTL, HELLS, MCM10, HSD17B7, CLSPN, FAM111B, MCM3, E2F7, ATAD2
## MSH6, MCM6, CCNE1, XRCC2, CCNE2, ESCO2, CDC6, NCOA7, HIST1H1B, LDLR
## ARHGAP10, SMC1A, RRM2, OLMALINC, HIST1H4C, POLR2J3, MMS22L, NR4A2, NEAT1, STAC2
## Negative: ARL6IP1, LGALS1, PTTG1, S100A16, CDC20, CCNB2, S100A10, NMU, CDKN3, CCNB1
## HMGN2, H2AFV, HMGB3, TMSB4X, ANXA3, ANXA2, SNRPG, TUBA1A, JPT1, PSMA7
## KLK5, NQO1, MIEN1, KRT14, TAGLN2, LCN2, LGALS3, RBP1, HIST3H2A, KLK6
# create heatmap
DimHeatmap(whim_filtered, dims = 1:15, cells = 500, balanced = TRUE)
Q9) What are the axes of each heatmap?
Y axis are genes X axis is reads
Q10) How many cells are used in each heatmap?
1
Q11) Describe what happens when the principal component (PC) number gets higher.
retains more biological signal
To visualize single cell data we need to use enough PCs to represent
the variability in the data without adding too much noise or having too
little. An Elbow plot can help with this along with looking at the
DimHeatmap plots. We need to pick a threshold close to the elbow in the
plot. This indicates where not much more information is being
added.
For the plot generated by the command below it looks to be around 5;
however, the default is usually 15, which is a good cutoff for most
datasets.
# draw elbow plot
ElbowPlot(whim_filtered)
#Create a tSNE plot using the first 5 dimensions.
whim_filtered <- RunTSNE(whim_filtered, dims = 1:5)
DimPlot(whim_filtered, reduction = "tsne")
# add additional code here based on Question 12
whim_filtered <- RunTSNE(whim_filtered, dims = 1:4)
DimPlot(whim_filtered, reduction = "tsne")
whim_filtered <- RunTSNE(whim_filtered, dims = 1:1)
DimPlot(whim_filtered, reduction = "tsne")
whim_filtered <- RunTSNE(whim_filtered, dims = 1:6)
DimPlot(whim_filtered, reduction = "tsne")
Q12) In the code block above, copy the 2 lines of code that generated the tSNE plot to answer the following questions. Note I am expecting at least 4 plots.
What happens when you use fewer dimensions?
less dimensionality
How about only the first dimension?
squiggly line
What happens when you use more dimensions?
more dimensionality !
This flat pink plot is pretty boring. Let’s add some color by coloring each cell with the expression level of the 2 top most variable genes you identified in Q8.
# change dimensions back to 1:5
whim_filtered <- RunTSNE(whim_filtered, dims = 1:5)
# create the feature plot
FeaturePlot(whim_filtered, features = c("SCGB2A2", "KRT16"), reduction = "tsne")
You should get 2 plots with the most highly expressed gene grouped into a single location. This could be a cluster of cells with similar gene expression profiles. To find out, let’s cluster the dataset, and then color the tSNE plot by cluster.
# find nearest neighbors and cluster cells
whim_filtered <- FindNeighbors(whim_filtered, dims = 1:5)
## Computing nearest neighbor graph
## Computing SNN
whim_filtered <- FindClusters(whim_filtered, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 1314
## Number of edges: 39301
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8145
## Number of communities: 5
## Elapsed time: 0 seconds
# overlay cluster colors onto dim plot
DimPlot(whim_filtered, reduction = "tsne", group.by = "seurat_clusters")
Q13) Which cluster includes the 2 genes you plotted in Q11?
cluster 4
Now we can explore each gene one at a time, which is not very
efficient.
Next we will run an all-by-all differential expression analysis to
identify the top 10 gene markers for each cluster.
# find marker genes for each cluster
whim_filtered.markers <- FindAllMarkers(whim_filtered, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
# identify the top 20 marker genes for each cluster
whim_filtered.markers %>% group_by(cluster) %>% top_n(n = 20, wt = avg_log2FC) -> top20
# plot marker gene expression for each cluster
DoHeatmap(whim_filtered, features = top20$gene)
## Warning in DoHeatmap(whim_filtered, features = top20$gene): The following
## features were omitted as they were not found in the scale.data slot for the RNA
## assay: VPS13B, RELB, TRMT2B, PPP1R12C, TCF19, EML1, ERCC6L2, FA2H, ETFDH,
## BTBD1, PHKG1, WASHC2A, DONSON, ZMIZ1, PPP2R5D, ANAPC2, MYO1E, ADAM10,
## STX17-AS1, USP54, PTK6, FUT2, HERC6, TYMSOS, CIT, DTD2, SWSAP1, YTHDF3-AS1,
## ARHGAP42, ZNF480, MAP9, METAP1D, TGFBR2, MLYCD, ASB6, ZNF454, TCTA, ENG, PSEN2
Q14) What is the x-axis for in this plot?
Differential expression clusters
Q15) Which 2 clusters have the most distinct expression profile from the rest of the clusters?
cluster 1 and cluster 4
Finally, let’s export your list of differentially expressed genes to
a text file.
First filter base on significant p-value, then save to a file. Column
explanations are as follows:
# filter on adjusted pval
whim_filtered.markers <- whim_filtered.markers[whim_filtered.markers$p_val_adj <= 0.001,]
# save to a CSV file
write.csv(whim_filtered.markers, file="whim_filtered_SignificantMarkerGenes.csv", quote = FALSE)
Either through Athena, or by downloading this file, open the CSV file and answer the following question.
Q16) What is the log2FC of the top most variable gene you identified in Q8?
2.799869
Bonus Q1) Pick one or more genes from the heatmap, any genes you want. Modify the violin plot command from the QC section by replacing the “nFeature_RNA”, “nCount_RNA”, and “percent.mt” with your gene names (delete or add entries as needed). Adjust the “n_col” attribute as needed, and add in the “group_by” statement from step 9 above to group the plots by cluster.
# Bonus Q1 code goes here
Bonus Q2) Plot the tSNE map, only color it by mitochondrial expression. Use the “cols” attribute to change the color scale (hint, look in the documentation for the command to see its usage).
# Bonus Q2 code goes here
Finally, save this file, then knit your HTML report for submission
via GitHub.
Go back to the Word document for github instructions!